Zurich - 27 & 28 June 2022

Multiple imputation (recap)

Multiple Imputation

  1. Incomplete data
  2. Generate multiple copies of the same dataset, but each time differenty imputed values
  3. Analyze each imputed dataset
  4. Pool results for analyses to final study result

Notes on multiple imputation

  • Takes imputation uncertainty into account
  • A method to improve the main analysis results, (so not to complete or fill in data)
  • Make sure that the imputation model
    • Holds the relevant variables to deal with missings
    • Is compatible with the analysis model

Full Information Maximum Likelihood

FIML

  • Uses all observed data, so also partly observed rows, to estimate model parameters.
  • Analysis and dealing with missing data, at once
  • Only involves variables used in the analysis
  • Algorithm used to estimate the parameters that most likely have generated the observed data.

How does it work

  • Iterative algorithm, such as Expectation Maximization (EM) or Maximum Likelihood (ML)
    • Each iteration, new parameters are evaluated/updated, until the fit with the data is optimized/maximized (does not improve anymore).
    • The fit is optimized when the likelihood function is maximized, i.e. when the data generated from the model parameters are closest to the observed data.

Note on FIML use

  • Can be used to impute single values
    • Be aware that this leads to similar disadvantages as single regression imputation
    • Not advised to use this method to obtain a fully observed dataset
  • Best used for model estimations
    • All observed information is used to estimate the best fitting parameters.
    • Estimation is done by an iterative algorithm
    • Similar information is used as in multiple imputation (if model holds same information), so results are comparable.

FIML in lavaan descriptives

  • In R use lavaan package with missing = "fiml".
  • Specify all variables with ~~ for the variance estimation
  • Use meanstructure = TRUE to obtain means of variables.
model <- '
  #variance
  Ozone ~~ Ozone
  Solar.R ~~ Solar.R
  Wind ~~ Wind
  Temp ~~ Temp
  '
fit <- sem(model, data = airquality, missing = "fiml", meanstructure = TRUE)

FIML descriptives

> 
> Parameter Estimates:
> 
>   Standard errors                             Standard
>   Information                                 Observed
>   Observed information based on                Hessian
> 
> Intercepts:
>                    Estimate  Std.Err  z-value  P(>|z|)
>     Ozone            42.129    3.050   13.815    0.000
>     Solar.R         185.932    7.428   25.032    0.000
>     Wind              9.958    0.284   35.076    0.000
>     Temp             77.882    0.763  102.112    0.000
> 
> Variances:
>                    Estimate  Std.Err  z-value  P(>|z|)
>     Ozone          1078.817  141.655    7.616    0.000
>     Solar.R        8054.991  942.768    8.544    0.000
>     Wind             12.330    1.410    8.746    0.000
>     Temp             89.006   10.176    8.746    0.000

FIML in lavaan correlation

  • In R use lavaan package with missing = "fiml".
  • Specify all variables with ~~ and other variables for the co-variance estimation
model <- '
  #correlation
  Ozone ~~ Solar.R + Wind + Temp
  Solar.R ~~ Wind + Temp
  Wind ~~ Temp
  '
fit <- sem(model, data = airquality, missing = "fiml")

FIML descriptives

  • Use standardized = TRUE in the summary function to obtain correlation.
> 
> Parameter Estimates:
> 
>   Standard errors                             Standard
>   Information                                 Observed
>   Observed information based on                Hessian
> 
> Covariances:
>                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
>   Ozone ~~                                                              
>     Solar.R         942.530  266.602    3.535    0.000  942.530    0.324
>     Wind            -64.636   11.033   -5.858    0.000  -64.636   -0.570
>     Temp            209.564   31.267    6.702    0.000  209.564    0.687
>   Solar.R ~~                                                            
>     Wind            -17.335   26.211   -0.661    0.508  -17.335   -0.055
>     Temp            238.073   74.272    3.205    0.001  238.073    0.281
>   Wind ~~                                                               
>     Temp            -15.172    2.946   -5.151    0.000  -15.172   -0.458
> 
> Intercepts:
>                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
>     Ozone            41.871    2.782   15.048    0.000   41.871    1.296
>     Solar.R         184.847    7.428   24.884    0.000  184.847    2.055
>     Wind              9.958    0.284   35.076    0.000    9.958    2.836
>     Temp             77.882    0.763  102.112    0.000   77.882    8.255
> 
> Variances:
>                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
>     Ozone          1044.019  129.627    8.054    0.000 1044.019    1.000
>     Solar.R        8090.702  950.667    8.511    0.000 8090.702    1.000
>     Wind             12.330    1.410    8.746    0.000   12.330    1.000
>     Temp             89.006   10.176    8.746    0.000   89.006    1.000

Output explained

Output is quite large and gives a lot of information at once.

  • Intercepts: FIML means
    • ~1 in output tables
  • Variance: FIML variances
    • variable x ~~ variable x in output tables
  • Covariances: FIML correlation (because standardized = TRUE)
    • variable x ~~ variable y in output tables

Use fmi = TRUE in the summary() function to get fraction of missing information.

fmi = the relative increase in variance and decrease of precision due to missing data, i.e. impact of missing data on estimates.

Missing data patterns

Missing data patterns in the data

inspect(fit, 'patterns') 
>      Ozone Solr.R Wind Temp
> [1,]     1      1    1    1
> [2,]     0      1    1    1
> [3,]     1      0    1    1
> [4,]     0      0    1    1

Missing data proportions

A symmetric matrix where each element contains the proportion of observed data points for the corresponding pair of observed variables.

inspect(fit, 'coverage')
>         Ozone Solr.R Wind  Temp 
> Ozone   0.758                   
> Solar.R 0.725 0.954             
> Wind    0.758 0.954  1.000      
> Temp    0.758 0.954  1.000 1.000

Linear regression analysis

code

model <- '
  Ozone ~ Solar.R 
  '
fit_fiml <- sem(model, data = airquality, missing = "fiml")
summary(fit_fiml, header = FALSE, fmi = TRUE)

Linear regression analysis

output

> 
> Parameter Estimates:
> 
>   Standard errors                             Standard
>   Information                                 Observed
>   Observed information based on                Hessian
> 
> Regressions:
>                    Estimate  Std.Err  z-value  P(>|z|)      FMI
>   Ozone ~                                                      
>     Solar.R           0.127    0.032    3.915    0.000    0.223
> 
> Intercepts:
>                    Estimate  Std.Err  z-value  P(>|z|)      FMI
>    .Ozone            18.599    6.687    2.781    0.005    0.218
> 
> Variances:
>                    Estimate  Std.Err  z-value  P(>|z|)      FMI
>    .Ozone           964.164  129.421    7.450    0.000    0.240

Compare with MI

imp_pmm <- mice(airquality %>% select(Ozone, Solar.R), method = "pmm", print = F)
fit_mi <- with(imp_pmm, lm(Ozone ~ Solar.R))
combi <- pool(fit_mi)
summary(combi)
>          term   estimate  std.error statistic      df      p.value
> 1 (Intercept) 18.6855578 6.07367372  3.076484 60.9087 0.0031360003
> 2     Solar.R  0.1143117 0.02964086  3.856558 57.5065 0.0002928367

SEM with auxiliary variables

code

  • Use auxiliary variables to improve missing data handling
  • Auxiliary variables to covariances in the model
model_aux <- '
  Ozone ~ Solar.R 
  Solar.R ~~ Wind + Temp + Month + Day
  Ozone ~~ Wind + Temp + Month + Day
  Wind ~~ Temp + Month + Day
  Temp ~~ Month + Day
  Month ~~ Day
  '
auxfit1 <- sem(model = model_aux, missing = "fiml", data = airquality)

SEM with auxiliary variables

output

> 
> Parameter Estimates:
> 
>   Standard errors                             Standard
>   Information                                 Observed
>   Observed information based on                Hessian
> 
> Regressions:
>                    Estimate  Std.Err  z-value  P(>|z|)      FMI
>   Ozone ~                                                      
>     Solar.R           0.112    0.030    3.701    0.000    0.158
> 
> Covariances:
>                    Estimate  Std.Err  z-value  P(>|z|)      FMI
>   Solar.R ~~                                                   
>     Wind            -17.069   26.123   -0.653    0.514    0.046
>     Temp            232.954   73.899    3.152    0.002    0.077
>     Month            -7.321   10.557   -0.693    0.488    0.056
>     Day            -119.301   66.532   -1.793    0.073    0.051
>  .Ozone ~~                                                     
>     Wind            -63.332   10.615   -5.966    0.000    0.095
>     Temp            183.490   28.455    6.448    0.000    0.102
>     Month             8.240    3.735    2.206    0.027    0.090
>     Day               6.540   23.317    0.280    0.779    0.134
>   Wind ~~                                                      
>     Temp            -15.172    2.946   -5.151    0.000   -0.000
>     Month            -0.884    0.407   -2.171    0.030   -0.000
>     Day               0.843    2.509    0.336    0.737   -0.000
>   Temp ~~                                                      
>     Month             5.607    1.168    4.799    0.000   -0.000
>     Day             -10.886    6.796   -1.602    0.109    0.000
>   Month ~~                                                     
>     Day              -0.099    1.009   -0.098    0.922   -0.000
> 
> Intercepts:
>                    Estimate  Std.Err  z-value  P(>|z|)      FMI
>    .Ozone            21.819    6.194    3.523    0.000    0.152
>     Solar.R         185.534    7.410   25.038    0.000    0.042
>     Wind              9.958    0.284   35.076    0.000    0.000
>     Temp             77.882    0.763  102.112    0.000    0.000
>     Month             6.993    0.114   61.269    0.000    0.000
>     Day              15.804    0.714   22.125    0.000    0.000
> 
> Variances:
>                    Estimate  Std.Err  z-value  P(>|z|)      FMI
>    .Ozone           943.445  119.142    7.919    0.000    0.180
>     Solar.R        8050.793  941.698    8.549    0.000    0.045
>     Wind             12.330    1.410    8.746    0.000   -0.000
>     Temp             89.006   10.176    8.746    0.000   -0.000
>     Month             1.993    0.228    8.746    0.000    0.000
>     Day              78.066    8.925    8.746    0.000    0.000

SEM with auxiliary variables

code

  • Easily add auxiliary variables with semTools package
aux.vars <- c("Wind", "Temp", "Month","Day")
auxfit <- sem.auxiliary(model=model, missing = "fiml", aux=aux.vars, fixed.x=FALSE, data=airquality)

SEM with auxiliary variables

output

> 
> Parameter Estimates:
> 
>   Standard errors                             Standard
>   Information                                 Observed
>   Observed information based on                Hessian
> 
> Regressions:
>                    Estimate  Std.Err  z-value  P(>|z|)      FMI
>   Ozone ~                                                      
>     Solar.R           0.112    0.030    3.701    0.000    0.158
> 
> Covariances:
>                    Estimate  Std.Err  z-value  P(>|z|)      FMI
>   Wind ~~                                                      
>     Temp            -15.172    2.946   -5.151    0.000    0.000
>     Month            -0.884    0.407   -2.171    0.030   -0.000
>     Day               0.843    2.509    0.336    0.737    0.000
>   Temp ~~                                                      
>     Month             5.607    1.168    4.799    0.000    0.000
>     Day             -10.886    6.796   -1.602    0.109    0.000
>   Month ~~                                                     
>     Day              -0.099    1.009   -0.098    0.922   -0.000
>   Wind ~~                                                      
>    .Ozone           -63.332   10.615   -5.966    0.000    0.095
>   Temp ~~                                                      
>    .Ozone           183.490   28.455    6.448    0.000    0.102
>   Month ~~                                                     
>    .Ozone             8.240    3.735    2.206    0.027    0.090
>   Day ~~                                                       
>    .Ozone             6.540   23.317    0.280    0.779    0.134
>   Wind ~~                                                      
>     Solar.R         -17.069   26.123   -0.653    0.514    0.046
>   Temp ~~                                                      
>     Solar.R         232.954   73.899    3.152    0.002    0.077
>   Month ~~                                                     
>     Solar.R          -7.321   10.557   -0.693    0.488    0.056
>   Day ~~                                                       
>     Solar.R        -119.301   66.532   -1.793    0.073    0.051
> 
> Intercepts:
>                    Estimate  Std.Err  z-value  P(>|z|)      FMI
>    .Ozone            21.819    6.194    3.523    0.000    0.152
>     Solar.R         185.534    7.410   25.038    0.000    0.042
>     Wind              9.958    0.284   35.076    0.000    0.000
>     Temp             77.882    0.763  102.112    0.000    0.000
>     Month             6.993    0.114   61.269    0.000    0.000
>     Day              15.804    0.714   22.125    0.000    0.000
> 
> Variances:
>                    Estimate  Std.Err  z-value  P(>|z|)      FMI
>    .Ozone           943.445  119.142    7.919    0.000    0.180
>     Solar.R        8050.793  941.698    8.549    0.000    0.045
>     Wind             12.330    1.410    8.746    0.000   -0.000
>     Temp             89.006   10.176    8.746    0.000    0.000
>     Month             1.993    0.228    8.746    0.000   -0.000
>     Day              78.066    8.925    8.746    0.000    0.000

Compare with MI

imp_pmm <- mice(airquality, method = "pmm", print = F)
fit_mi <- with(imp_pmm, lm(Ozone ~ Solar.R))
combi <- pool(fit_mi)
summary(combi)          
>          term   estimate std.error statistic       df     p.value
> 1 (Intercept) 21.1824955 6.1777787  3.428821 72.42304 0.001003733
> 2     Solar.R  0.1117497 0.0295637  3.779962 79.33684 0.000302260

Missing data in questionnaires

Multi-item questionnaires

  • Constructs measured indirectly, through items.
  • Item score can be continuous, dichotomous or measured on Likert scale.
  • Summary score of items is the construct

Summarizing item scores

  • CTT: sum score or mean score
  • Latent variable obtained via Item Response model (e.g. Rasch model or 2 parameter logistic model)
  • Latent variable measured in a structural equation model

Missing data in multi-item questionnaires

  • Missing item level scores
  • Missing full questionnaire
>   item1 item2 item3 item4 item5 Total.score
> 1     1     0     0     0     1           2
> 2    NA     1     0    NA     1          NA
> 3    NA    NA    NA    NA    NA          NA

Both lead to a missing total score

Advice in user manuals

  • User manuals often have a statement about dealing with missing items scores
  • This method is not always evidence-based
  • Examples:
    • SF-36: “Items that are left blank (missing data) are not taken into account when calculating the scale scores. Hence, scale scores represent the average for all items in the scale that the respondent answered.” (Ware & Sherbourne, 1992)
    • SDQ: “The SDQ comprises of 5 items for 5 subscales. For each of the 5 scales the score can range from 0 to 10 if all items were completed. These scores can be scaled up pro-rata if at least 3 items were completed, e.g. a score of 4 based on 3 completed items can be scaled up to a score of 7 (6.67 rounded up) for 5 items.” (Goodman, 2010)

Person mean imputation

  • Single imputation of mean over the observed items (within person)
  • Same as average over the available items
  • Best ad hoc single imputation method available
  • Disadvantages of single imputation: no imputation uncertainty
  • Works best when correlation between items is higher

Example of person mean imputation

>   Q1i1 Q1i2 Q1i3 Q1i4 Q1i5 TSQ1
> 1   NA   NA   NA    5    5   NA
> 2    1    2    5    5    1   14
> 3   NA   NA   NA    3    4   NA
> 4    5    5    5    5    5   25
> 5    3    5    4    1    3   16
x %>% 
  #compute average over available items (AAI)
  mutate(AAI = rowMeans(select(.,Q1i1, Q1i2, Q1i3, Q1i4, Q1i5), na.rm = T)) %>%
  #then apply rule to all items that if the score is missing, to replace it with AAI
   mutate_at(.vars = vars(Q1i1:Q1i5),
             .funs = list(~ ifelse(is.na(.), AAI, .))) %>%
  mutate(TSQ1 = rowSums(select(.,Q1i1, Q1i2, Q1i3, Q1i4, Q1i5)))
>   Q1i1 Q1i2 Q1i3 Q1i4 Q1i5 TSQ1 AAI
> 1  5.0  5.0  5.0    5    5 25.0 5.0
> 2  1.0  2.0  5.0    5    1 14.0 2.8
> 3  3.5  3.5  3.5    3    4 17.5 3.5
> 4  5.0  5.0  5.0    5    5 25.0 5.0
> 5  3.0  5.0  4.0    1    3 16.0 3.2

Note on person mean imputation

  • Can get unstable when:
    • Many items are missing.
    • Correlations between items is low.
  • Single imputation method: no missing data uncertainty.

Multiple imputation in multi-item questionnaires

  • Imputing item scores versus imputing total scores
  • Item scores can hold valuable information
  • Total scores are often used in analyses

Advised strategy for imputation

  • Majority of item scores observed, use item level imputation
  • Only few item scores observed or none, use total score imputation
  • Combine both strategies

Challenges in multi-item questionnaire missings

  • Imputation model can grow large when all items are used
  • When the total score is used in analyses, the total score should be used as predictor for other variables

Solutions

Imputation model can grow large when all items are used

  • Isolate item imputation per questionnaire or subscale via prediction matrix
  • Especially relevant in longitudinal data

When the total score is used in analyses, the total score should be used as predictor for other variables

  • Update total score after each iteration by using passive imputation.
  • Use total score as predictor for other covariates via predictorMatrix.

Illustration item and total scores

  • Isolate item imputation per questionnaire or subscale.
    • Use predictor matrix: rows indicate imputed variable, columns are predictors.


Data set with two questionnaires each 5 items, and 1 covariate.

  • For Q1:
>      Q1i1 Q1i2 Q1i3 Q1i4 Q1i5 Q2i1 Q2i2 Q2i3 Q2i4 Q2i5 TSQ1 TSQ2 cov1
> Q1i1    0    1    1    1    1    0    0    0    0    0    1    1    1
> Q1i2    1    0    1    1    1    0    0    0    0    0    1    1    1
> Q1i3    1    1    0    1    1    0    0    0    0    0    1    1    1
> Q1i4    1    1    1    0    1    0    0    0    0    0    1    1    1
> Q1i5    1    1    1    1    0    0    0    0    0    0    1    1    1

Illustration item and total scores

  • Isolate item imputation per questionnaire or subscale



  • For Q2:
>      Q1i1 Q1i2 Q1i3 Q1i4 Q1i5 Q2i1 Q2i2 Q2i3 Q2i4 Q2i5 TSQ1 TSQ2 cov1
> Q2i1    0    0    0    0    0    0    1    1    1    1    1    1    1
> Q2i2    0    0    0    0    0    1    0    1    1    1    1    1    1
> Q2i3    0    0    0    0    0    1    1    0    1    1    1    1    1
> Q2i4    0    0    0    0    0    1    1    1    0    1    1    1    1
> Q2i5    0    0    0    0    0    1    1    1    1    0    1    1    1

Illustration item and total scores

  • Total score cannot be used as predictor for its own items.
>      Q1i1 Q1i2 Q1i3 Q1i4 Q1i5 Q2i1 Q2i2 Q2i3 Q2i4 Q2i5 TSQ1 TSQ2 cov1
> Q1i1    0    1    1    1    1    0    0    0    0    0    0    1    1
> Q1i2    1    0    1    1    1    0    0    0    0    0    0    1    1
> Q1i3    1    1    0    1    1    0    0    0    0    0    0    1    1
> Q1i4    1    1    1    0    1    0    0    0    0    0    0    1    1
> Q1i5    1    1    1    1    0    0    0    0    0    0    0    1    1
> Q2i1    0    0    0    0    0    0    1    1    1    1    1    0    1
> Q2i2    0    0    0    0    0    1    0    1    1    1    1    0    1
> Q2i3    0    0    0    0    0    1    1    0    1    1    1    0    1
> Q2i4    0    0    0    0    0    1    1    1    0    1    1    0    1
> Q2i5    0    0    0    0    0    1    1    1    1    0    1    0    1

Illustration item and total scores

  • Item scores cannot be used as predictor together with its own total scores.
>      Q1i1 Q1i2 Q1i3 Q1i4 Q1i5 Q2i1 Q2i2 Q2i3 Q2i4 Q2i5 TSQ1 TSQ2 cov1
> TSQ1    0    0    0    0    0    0    0    0    0    0    0    1    1
> TSQ2    0    0    0    0    0    0    0    0    0    0    1    0    1
> cov1    0    0    0    0    0    0    0    0    0    0    1    1    0

Illustration item and total scores

  • Total score should be used as predictor for other variables when used in the analysis model.
>      TSQ1 TSQ2 cov1
> TSQ1    0    1    1
> TSQ2    1    0    1
> cov1    1    1    0

Illustration item and total scores

  • Full predictor matrix
>      Q1i1 Q1i2 Q1i3 Q1i4 Q1i5 Q2i1 Q2i2 Q2i3 Q2i4 Q2i5 TSQ1 TSQ2 cov1
> Q1i1    0    1    1    1    1    0    0    0    0    0    0    1    1
> Q1i2    1    0    1    1    1    0    0    0    0    0    0    1    1
> Q1i3    1    1    0    1    1    0    0    0    0    0    0    1    1
> Q1i4    1    1    1    0    1    0    0    0    0    0    0    1    1
> Q1i5    1    1    1    1    0    0    0    0    0    0    0    1    1
> Q2i1    0    0    0    0    0    0    1    1    1    1    1    0    1
> Q2i2    0    0    0    0    0    1    0    1    1    1    1    0    1
> Q2i3    0    0    0    0    0    1    1    0    1    1    1    0    1
> Q2i4    0    0    0    0    0    1    1    1    0    1    1    0    1
> Q2i5    0    0    0    0    0    1    1    1    1    0    1    0    1
> TSQ1    0    0    0    0    0    0    0    0    0    0    0    1    1
> TSQ2    0    0    0    0    0    0    0    0    0    0    1    0    1
> cov1    0    0    0    0    0    0    0    0    0    0    1    1    0

Passive imputation total score

  • Total score is used as predictor, but not directly imputed.
  • Item scores are imputed.
  • Total scores re-calculated from the imputed item scores: Passive imputation

Passive imputation process

During each iteration for Q1:

1. Impute item scores using items from its own questionnaire, total score(s) from other questionnaires and covariate(s).

  • \(Q1i1_i = Q1i2_i \dot\beta_1 + Q1i3_i \dot\beta_2 + Q1i4_i \dot\beta_3 + Q1i5_i \dot\beta_4 + TSQ2_i \dot\beta_5 + cov1_i \dot\beta_6\)
  • \(Q1i2_i = Q1i1_i \dot\beta_1 + Q1i3_i \dot\beta_2 + Q1i4_i \dot\beta_3 + Q1i5_i \dot\beta_4 + TSQ2_i \dot\beta_5 + cov1_i \dot\beta_6\)
  • \(etc.\)

2. Total score is re-calculated using the imputed item scores.

  • \(TSQ1_i = Q1i1_i + Q1i2_i + Q1i3_i + Q1i4_i + Q1i5_i\)

3. Updated total score is used as predictor for covariate(s) and items of other questionnaires in next iteration.

  • \(Q2i1_i = Q2i2_i \dot\beta_1 + Q2i3_i \dot\beta_2+ Q2i4_i \dot\beta_3+ Q2i5_i \dot\beta_4+ TSQ1_i \dot\beta_5 + cov1_i \dot\beta_6\)

Note the \(_i\) indicates imputed value from the previous iteration.

Passive imputation code

  • Change the imputation method for the total scores.
>  TSQ1  TSQ2 
> "pmm" "pmm"
  • The total score is calculated as the sum of the items.
>                                   TSQ1                                   TSQ2 
> "~I(Q1i1 + Q1i2 + Q1i3 + Q1i4 + Q1i5)" "~I(Q2i1 + Q2i2 + Q2i3 + Q2i4 + Q2i5)"

Missing data in IRT models

  • Some of the estimation methods for IRT methods deal with missing values by MAR assumption
  • For example maximum likelihood estimation

Missing data in SEM models

  • Full Information Maximum Likelihood estimation
  • In R the lavaan package - simulates Mplus
  • When total scores are analyzed, item information can be included as auxiliary variables in the SEM model.

Missing item scores

Summary

  • Item level information may need a different strategy in order to use all available information
  • When there are missing values at item level only, use a strategy that involves the item scores.
  • When mostly the full questionnaire is missing, the missing data can be dealt with at the total score level.

Longitudinal missing data

Repeated measurements data

  • In longitudinal studies people are monitored for a longer time period.
  • For example RCT with follow-up.
    • Baseline
    • Post-treatment
    • Follow-up
  • Intensive longitudinal data: sensor measurements, daily measures.

Challenges

  • Multiple variables at each time-point
  • Repeated measurements are correlated within persons: multilevel structure
  • Measurement time-points some times do not line up

Data structure

  • Wide data format
    • One row per person, repeated measurements in the columns
  • Long data format
    • Multiple rows per person, one column as time-point indicator

Wide data format

>    id group          T0          T1        T2
> 1   1     0 -0.89273971  0.62595050 1.6787242
> 2   2     1  1.09466683  1.66393217 2.6314272
> 3   3     0  0.53677979  1.99526687 3.0587021
> 4   4     1 -0.08057365  1.35645288 1.5600102
> 5   5     0 -0.24587013 -0.06647697 0.5832191
> 6   6     1  0.60444222  1.45718160 2.7469386
> 7   7     0  0.00417498  0.21136080 1.4365193
> 8   8     1  1.94695848  2.76345288 3.6306766
> 9   9     0 -2.22879834 -0.75850320 0.8149130
> 10 10     1 -0.23124489  1.28434128 2.1040148
> 11 11     0  1.53801265  2.04635754 2.6932853
> 12 12     1  0.65401059  1.46172168 2.0400761
> 13 13     0 -0.29052400 -0.94582269 1.3305414
> 14 14     1  0.30578755  1.01441996 2.6523174
> 15 15     0 -0.70110697  0.67496886 1.4098061

Long data format

>    id group time     outcome
> 1   1     0   T0 -0.89273971
> 2   1     0   T1  0.62595050
> 3   1     0   T2  1.67872419
> 4   2     1   T0  1.09466683
> 5   2     1   T1  1.66393217
> 6   2     1   T2  2.63142721
> 7   3     0   T0  0.53677979
> 8   3     0   T1  1.99526687
> 9   3     0   T2  3.05870208
> 10  4     1   T0 -0.08057365
> 11  4     1   T1  1.35645288
> 12  4     1   T2  1.56001019
> 13  5     0   T0 -0.24587013
> 14  5     0   T1 -0.06647697
> 15  5     0   T2  0.58321912

Missing data imputation

  • Wide imputation
  • Long imputation: multilevel imputation

Wide data imputation

  • Similar to multiple imputation on cross-sectional data
  • Number of variables in imputation model may be a challenge
  • Imputation model ~ analysis model

Example wide imputation

Data set for two groups with measurements at three time-points for:

  • outcome
  • covariate 1
  • covariate 2
  • covariate 3

In total: 13 variables in wide imputation. Can grow large when more variables are measured at each time-point.

Practical solution for wide imputation

  • Leave out covariates from other time-points as predictors in imputation model
  • Adapt the predictormatrix

Example predictormatrix for cov1

>         outcome_T0 outcome_T1 outcome_T2 cov1_T0 cov1_T1 cov1_T2 cov2_T0
> cov1_T0          1          1          1       0       1       1       1
> cov1_T1          1          1          1       1       0       1       0
> cov1_T2          1          1          1       1       1       0       0
>         cov2_T1 cov2_T2 cov3_T0 cov3_T1 cov3_T2
> cov1_T0       0       0       1       0       0
> cov1_T1       1       0       0       1       0
> cov1_T2       0       1       0       0       1

Imputation of long data

  • Multiple rows per subject in the data
  • Multilevel imputation
  • Account for the clustering of measurements within subjects

Methods for multilevel imputation

  • miceadds package contains many additional methods
  • broom.mixed to enable the pool function for the mice output.

Long imputation predictormatrix

  • Use -2 for cluster variable id in predictormatrix (random intercept)
>         id group time outcome cov1 cov2 cov3
> id       0     1    1       1    1    1    1
> group   -2     0    1       1    1    1    1
> time    -2     1    0       1    1    1    1
> outcome -2     1    1       0    1    1    1
> cov1    -2     1    1       1    0    1    1
> cov2    -2     1    1       1    1    0    1
> cov3    -2     1    1       1    1    1    0

Long imputation method

  • Change method to 2l.pmm
>       id    group     time  outcome     cov1     cov2     cov3 
>       ""       ""       "" "2l.pmm" "2l.pmm"       ""       ""

FIML for missing outcomes

  • Longitudinal analysis model with FIML estimation: multilevel model
  • Missing data in dependent variable solved in Maximum likelihood estimation
    • \(Outcome = \beta_0 + \beta_1*cov2 + b_{0j} + \epsilon_{ij}\)
    • \(b_{0j}\) is random intercept at subject level
  • Estimation method optimizes model parameters using all observed data of dependent variable

Missing covariate data are handled by listwise deletion

Example no imputation

  • Only cov1 has missing data
  • Outcome variable has missing observations (drop-out)
>     id group time cov2 cov3 cov1 outcome   
> 111  1     1    1    1    1    1       1  0
> 20   1     1    1    1    1    1       0  1
> 18   1     1    1    1    1    0       1  1
> 1    1     1    1    1    1    0       0  2
>      0     0    0    0    0   19      21 40
> # A tibble: 4 x 8
>   effect   group    term         estimate std.error statistic conf.low conf.high
>   <chr>    <chr>    <chr>           <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
> 1 fixed    <NA>     (Intercept)    1.06       0.130     8.10     0.801     1.31 
> 2 fixed    <NA>     cov2           0.0190     0.115     0.166   -0.206     0.244
> 3 ran_pars id       sd__(Interc~   0.589     NA        NA       NA        NA    
> 4 ran_pars Residual sd__Observa~   1.11      NA        NA       NA        NA

Example with imputation

  • Impute the outcome (and cov1) as comparison.
>          term   estimate std.error statistic        df      p.value
> 1 (Intercept) 1.06633608 0.1325362 8.0456241 111.34485 1.018519e-12
> 2        cov2 0.01953625 0.1163277 0.1679415  86.62942 8.670208e-01

Longitudinal missing data

Summary

  • Missing data in dependent variable are handled in multilevel analysis model
  • Regular multiple imputation of wide data when possible
  • Multilevel multiple imputation to deal with missing covariates when design is too complicated for wide imputation